
Final Project

Group 1:
Hiba Awan
Nathania Stephens
Abstract
Introduction & Background
Motivation/ Purpose
In 2023, there were over 30,000 arrests and close to 65,000 citations in Fairfax County. The Fairfax County boundaries, include areas such as Centreville, Chantilly, Herndon, Reston, Tysons Corner, McLean, Merrifield, George Mason, Annadale, Burke, Springfield, Alexandria, Lorton to name a few. If you live, work, or study in these areas then this project should be of interest to you. This project aims to inform Fairfax County patrons of crime information and hopefully provide some statistical insights that could be applicable.
Goals/ Objectives
In order to provide relevant and insightful crime information, several different visualization methods were applied to help easily interpret and compare data. Statistical learning techniques were utilized to help understand statistic significantly factors and associations between variables. Since the data utilized for this project is largely categorical the project focuses on techniques such as Chi-Squared Test, Logistic Regression, Decision Trees and Random Forest.
Data
Overview
About the Data
Three datasets were pulled from the Fairfax County Police Department website. They covered arrest, citations, and warnings in the year 2023. For simplicity, general definitions are provided:
Arrest - When a person is taken into custody to answer for an offense or when there is a deprivation or restraint of a person’s liberty in any significant way.
Citation - Formal notice issued by law enforcement officer for a violation of law, typically related to traffic laws or other minor offenses. Typically requiring a violator to appear in court or pay a fine.
Warning - When a violation, typically minor, has been made but an officer issues a warning rather than a citation.
The data sets included between 24 and 34 variables, but some of many of the variables were redundant or were not applicable to the research (e.g. web_address, phone_number, name). The following attributes were key to the research conducted:
| Column Name | Data Type | Description |
|---|---|---|
| Date | Date | Date of Violation |
| Time | Chr | Time of Violation |
| Offense | Chr | Description of Violation |
| Gender | Chr | Gender of Violator |
| Ethnicity | Chr | Hispanic or Non-Hispanic |
| District | Chr | Administrative area |
| Latitude | Dbl | Coordinates measuring north/ south of equator |
| Longitude | Dbl | Coordinates measuring east/ west of prime meridian |
| Outcome | Chr | Result of violation, arrest, citation, or warning |
Limitations and Assumptions
Due to the nature of the data available on the Fairfax County Police Department website, analysis was limited to qualitative techniques. The approach taken for the project focused on predicting through qualitative responses or classification. This means that each record pulled from the Fairfax County Police Department (FCPD) would be assigned to a category or class.
While understanding local crime is the goal of this project, the data acquired only accounts for crime that was recorded by FCPD. It does not take into account crimes that were not report or any other crime that was not reported through FCPD channels.
Cleaning and Transformation
To address questions related to gender, the data needed to be standardized and correctly categorized. Column names needed to be consistent across the three datasets to merge. Gender was used over Sex. Next the column data would be transformed to consistent labels, e.g. Male, Female, and Other/Unknown. Total proportion for Gender was examined, to verify that other/ unknown class could be removed without…
Research Questions
Is there an association between gender and warnings?
Does Time of Day or Day of the Month Factor into Number of Citations?
Research & Analysis
Question 1: Is there an association between gender and warnings?
To address this question the null and alternative hypothesis are established.
Null Hypothesis: There is no association between gender and violation outcome, warning or citation. This would mean that the likelihood of a violator getting a warning is independent of gender.
Alternative Hypothesis: There is an association between gender and violation outcome. This implies that gender affects the outcome of whether a violator is given a citation or warning.
According to the cleaned and combined dataset for warnings and citation, there was a total of 88,320 records. By looking at the counts for each outcome (citation or warning), there are a lot more citations than there are warnings given out by FCPD. This stacked bar chart also shows that males have a higher count for both categories.
Next, the warning rate for gender is calculated. This looks at the probability of a male or female violator receiving a Warning instead of a citation e.g. getting out of a ticket. To calculate warning rate, the number of warnings are divided by the total number of incidents.
\[ \begin{align*} Warning Rate = \frac{\text{Number of Warnings}}{\text{Total Incidents (Warnings + Citations)}} \end{align*} \]
This shows a slight difference in proportion between the two genders, with females having a higher warning rate than males. In other words, females received more warnings than males. Is this difference significant or is it a result of chance or other factors? To help understand these results, the Chi-Square Test of Independence is used. The Chi-Square Test of Independence will help determine whether the variables, gender and outcome, are independent or if there is a relationship between them.

\[ \chi^2 = \sum \frac{(O-E)^2}{E} \] To implement the Chi-Square Test, a contingency table is generated, which shows the distribution for gender and outcome.
| Gender | Citations | Warnings |
|---|---|---|
| Female | 20,478 | 8,777 |
| Male | 43,657 | 15,408 |
The results of the Chi-Square test shows:
- Chi-Square Statistic (x-squared): The chi-square test statistic is 150.62. This is the discrepancy between the observed frequencies, citations and warnings, and the expected frequencies if there were no association between the gender and outcome. This is demonstrated in the below tables.
| Gender | Expected Citations | Observed Citations | Expected Warnings | Observed Warnings |
|---|---|---|---|---|
| Female | 21,243 | 20,478 | 8,011 | 8,777 |
| Male | 42,891 | 43,657 | 16,173 | 15,408 |
Degrees of Freedom (df): The degree of freedom for this test is 1, which is the number of rows minus 1 multiplied by number of columns minus 1.
p-value: The p-value is 2.2e-16 which is much smaller than 0.05. This represents the probability of observing the chi-square statistic, 150.62, or more if the null hypothesis were true.
To visualize each value in the above table by its contribution to the chi-square test a heatmap is generated. This quickly shows which values have the highest contribution percentage.

Thus the null hypothesis is rejected. The results show that there is a statistically significant association between gender and violation outcome in Fairfax County. In other words, the Chi-square test indicates that the likelihood of a violation outcome is significantly associated with gender.
Question #: Does Time of Day or Day of the Month Factor into Number of Citations?
To examine the likelihood of a getting a citation versus a warning (outcome of a violation), a heatmap is generated to understand how this behavior looks during each day of the month and hour of the day (24-hour clock). To calculate the citation rate, the following equation is used.
\[ \begin{align*} Citation Rate = \frac{\text{Number of Citations}}{\text{Total Incidents (Warnings + Citations)}} \end{align*} \]
The higher intensity or darker color areas represent a higher citation rate. The citation rate represents the number of citations divided by the total outcome which includes both warnings and citations. During the hours of 0500-0600 there appears to be less warnings issued and instead citations issues. This also corresponds to morning rush times.
If interactive heatmap does not show:

Using Logistic Regression… will update this with cleaner numbers….
# A tibble: 88,320 × 7
BinaryOutcome Hour DayOfMonth DayOfWeek District Gender Race
<fct> <fct> <fct> <ord> <fct> <fct> <fct>
1 0 16 12 Wed Dranesville Male W
2 0 15 13 Mon Providence Male B
3 0 15 13 Mon Providence Male B
4 0 15 13 Mon Providence Male B
5 0 5 9 Thu Sully Female B
6 0 0 10 Mon Franconia Male W
7 0 1 10 Mon Franconia Male A
8 0 13 11 Tue Dranesville Male B
9 0 11 31 Tue Hunter Mill Male B
10 0 11 10 Mon Hunter Mill Female W
# ℹ 88,310 more rows
Call: glm(formula = model_formula_2, family = binomial(link = "logit"),
data = data_for_modeling)
Coefficients:
(Intercept) Hour1 Hour2
-0.543953 0.067872 0.138748
Hour3 Hour4 Hour5
0.322492 -0.134056 -0.715503
Hour6 Hour7 Hour8
-0.928240 -0.183564 -0.186012
Hour9 Hour10 Hour11
-0.216679 -0.281275 -0.374341
Hour12 Hour13 Hour14
-0.299027 -0.369616 -0.369130
Hour15 Hour16 Hour17
-0.257313 -0.357045 -0.343120
Hour18 Hour19 Hour20
-0.209459 0.024723 0.208840
Hour21 Hour22 Hour23
0.263136 0.211006 0.046055
DayOfMonth2 DayOfMonth3 DayOfMonth4
-0.017060 -0.284725 -0.226471
DayOfMonth5 DayOfMonth6 DayOfMonth7
-0.307410 0.095658 0.016076
DayOfMonth8 DayOfMonth9 DayOfMonth10
-0.155587 -0.280091 -0.159569
DayOfMonth11 DayOfMonth12 DayOfMonth13
-0.285784 -0.294318 -0.116350
DayOfMonth14 DayOfMonth15 DayOfMonth16
-0.110178 0.116435 0.008238
DayOfMonth17 DayOfMonth18 DayOfMonth19
-0.196301 -0.244815 -0.065312
DayOfMonth20 DayOfMonth21 DayOfMonth22
-0.188286 -0.158645 -0.150123
DayOfMonth23 DayOfMonth24 DayOfMonth25
-0.172025 -0.015141 -0.373595
DayOfMonth26 DayOfMonth27 DayOfMonth28
-0.299717 -0.205006 -0.131540
DayOfMonth29 DayOfMonth30 DayOfMonth31
-0.315680 -0.224334 0.006383
DayOfWeek.L DayOfWeek.Q DayOfWeek.C
0.475198 0.108194 0.056998
DayOfWeek^4 DayOfWeek^5 DayOfWeek^6
-0.125991 -0.038477 -0.074402
DistrictDranesville DistrictFranconia DistrictHunter Mill
0.182899 0.416758 0.135421
DistrictMason DistrictMount Vernon DistrictProvidence
0.333657 0.591765 0.168673
DistrictSpringfield DistrictSully GenderMale
-0.249097 -0.196909 -0.206635
RaceB RaceI RaceU
0.141369 0.003108 -0.325427
RaceW
-0.042346
Degrees of Freedom: 88319 Total (i.e. Null); 88247 Residual
Null Deviance: 103700
Residual Deviance: 99940 AIC: 100100
Variable OddsRatio
Hour6 Hour6 0.3952486
Hour5 Hour5 0.4889459
Hour11 Hour11 0.6877422
DayOfMonth25 DayOfMonth25 0.6882558
Hour13 Hour13 0.6909997
Hour14 Hour14 0.6913355
Hour16 Hour16 0.6997413
Hour17 Hour17 0.7095532
RaceU RaceU 0.7222187
DayOfMonth29 DayOfMonth29 0.7292931
DayOfMonth5 DayOfMonth5 0.7353494
DayOfMonth26 DayOfMonth26 0.7410280
Hour12 Hour12 0.7415395
DayOfMonth12 DayOfMonth12 0.7450394
DayOfMonth11 DayOfMonth11 0.7514248
DayOfMonth3 DayOfMonth3 0.7522211
Hour10 Hour10 0.7548205
DayOfMonth9 DayOfMonth9 0.7557153
Hour15 Hour15 0.7731266
DistrictSpringfield DistrictSpringfield 0.7795040
DayOfMonth18 DayOfMonth18 0.7828495
DayOfMonth4 DayOfMonth4 0.7973425
DayOfMonth30 DayOfMonth30 0.7990478
Hour9 Hour9 0.8051887
Hour18 Hour18 0.8110225
GenderMale GenderMale 0.8133167
DayOfMonth27 DayOfMonth27 0.8146427
DistrictSully DistrictSully 0.8212650
DayOfMonth17 DayOfMonth17 0.8217645
DayOfMonth20 DayOfMonth20 0.8283775
Variable OddsRatio
DistrictMount Vernon DistrictMount Vernon 1.8071754
DayOfWeek.L DayOfWeek.L 1.6083326
DistrictFranconia DistrictFranconia 1.5170354
DistrictMason DistrictMason 1.3960644
Hour3 Hour3 1.3805635
Hour21 Hour21 1.3010033
Hour22 Hour22 1.2349197
Hour20 Hour20 1.2322479
DistrictDranesville DistrictDranesville 1.2006927
DistrictProvidence DistrictProvidence 1.1837330
RaceB RaceB 1.1518494
Hour2 Hour2 1.1488349
DistrictHunter Mill DistrictHunter Mill 1.1450188
DayOfMonth15 DayOfMonth15 1.1234850
DayOfWeek.Q DayOfWeek.Q 1.1142641
DayOfMonth6 DayOfMonth6 1.1003830
Hour1 Hour1 1.0702283
DayOfWeek.C DayOfWeek.C 1.0586540
Hour23 Hour23 1.0471321
Hour19 Hour19 1.0250314
DayOfMonth7 DayOfMonth7 1.0162059
DayOfMonth16 DayOfMonth16 1.0082723
DayOfMonth31 DayOfMonth31 1.0064030
RaceI RaceI 1.0031125
DayOfMonth24 DayOfMonth24 0.9849729
DayOfMonth2 DayOfMonth2 0.9830843
DayOfWeek^5 DayOfWeek^5 0.9622539
RaceW RaceW 0.9585381
DayOfMonth19 DayOfMonth19 0.9367755
DayOfWeek^6 DayOfWeek^6 0.9282980
Another Questions….
To address each of these question, first exploratory analysis should be done to gain an understanding and summary of the crime metrics for Fairfax County. This includes understanding what type of crimes occurred the most and where.
General crime Mapping the arrest data for a geospatial visual of where arrest occur.
Next we look at the Top 10 Arrest Type by Incident Based Reporting (IBR) codes.
